Programming Windows Azure : Understanding the Value of Queues

11/21/2010 11:42:07 AM

Think about your last trip to your local, high-volume coffee shop. You walk up to a counter and place an order by paying for your drink. The barista marks a cup with your order and places it in a queue, where someone else picks it up and makes your drink. You head off to wait in a different line (or queue) while your latte is prepared.

This common, everyday exchange provides several interesting lessons:

The coffee shop can increase the number of employees at the counter, or increase the number of employees making coffee, with neither action depending on the other, but depending on the load.
The barista making coffee can take a quick break and customers don’t have to wait for someone to take their orders. The system keeps moving, but at a slower pace.
The barista is free to pick up different cups, and to process more complex orders first, or process multiple orders at once.

This is a significant amount of flexibility, something that a lot of software systems can be envious of. This is also a classic example of a good queuing system.

There are several good uses for queues, but most people wind up using queues to implement one important component of any scalable website or service: asynchronous operations. Let’s illustrate this with an example most of you should be familiar with: Twitter.

Consider how Twitter works. You publish an update and it shows up in several places: your time line, the public time line, and for all the people following you. When building a prototype of Twitter, it might make sense to implement it in the simplest way possible—when you select the Update button (or submit it through the API) you publish changes into all the aforementioned places. However, this wouldn’t scale.

When the site is under stress, you don’t want people to lose their updates because you couldn’t update the public timeline in a timely manner. However, if you take a step back, you realize that you don’t need to update all these places at the exact same time. If Twitter had to prioritize, it would pick updating that specific user’s time line first, and then feed to and from that user’s friends, and then the public time line. This is exactly the kind of scenario in which queues can come in handy.

Using queues, Twitter can queue updates, and then publish them to the various time lines after a short delay. This helps the user experience (your update is immediately accepted), as well as the overall system performance (queues can throttle publishing when Twitter is under load).

These problems face any application running on Windows Azure. Queues in Windows Azure allow you to decouple the different parts of your application, as shown in Figure 1.

Figure 1. Decoupling applications with queues

In this model, a set of frontend roles process incoming requests from users (think of the user typing in Twitter updates). The frontend roles queue work requests to a set of Windows Azure queues. The backend roles implement some logic to process these work requests. They take items off the queue and do all the processing required. Since only small messages can be put on the queue, the frontends and backends share data using the blob or the table storage services in Windows Azure.

Video sites are a canonical, or commonly demonstrated, usage of Windows Azure queues. After you upload your videos to sites such as YouTube and Vimeo, the sites transcode your videos into a format suitable for viewing. The sites must also update their search indexes, listings, and various other parts. Accomplishing this when the user uploads the video would be impossible, because some of these tasks (especially transcoding from the user’s format into H.264 or one of the other web-friendly video formats) can take a long time.

The typical way websites solve this problem is to have a bunch of worker process nodes keep reading requests off queues to pick videos to transcode. The actual video files can be quite large, so they’re stored in blob storage, and deleted once the worker process has finished transcoding them (at which point, they’re replaced by the transcoded versions). Since the actual work done by the frontends is quite small, the site could come under heavy upload and not suffer any noticeable performance drop, apart from videos taking longer to finish transcoding.

Since Windows Azure queues are available over public HTTP, like the rest of Windows Azure’s storage services, your code doesn’t have to run on Windows Azure to access them. This is a great way to make your code on Windows Azure and your code on your own hardware interoperate. Or you could use your own hardware for all your code, and just use Windows Azure queues as an endless, scalable queuing mechanism.

Let’s break down some sample scenarios in which websites might want to use Windows Azure queues.

1. Decoupling Components

Bad things happen in services. In large services, bad things happen with alarming regularity. Machines fail. Software fails. Software gets stuck in an endless loop, or crashes and restarts, only to crash again. Talk to any developer running a large service and he will have some horror stories to tell. A good way to make your service resilient in the face of these failures is to make your components decoupled from each other.

When you split your application into several components hooked together using queues, failure in one component need not necessarily affect failure in another component. For example, consider the trivial case where a frontend server calls a backend server to carry out some operation. If the backend server crashes, the frontend server is going to hang or crash as well. At a minimum, it is going to degrade the user experience.

Now consider the same architecture, but with the frontend and backend decoupled using queues. In this case, the backend could crash, but the frontend doesn’t care—it merrily goes on its way submitting jobs to the queue. Since the queue’s size is virtually infinite for all practical purposes, there is no risk of the queue “filling up” or running out of disk space.

What happens if the backend crashes while it is processing a work item? It turns out that this is OK, too. Windows Azure queues have a mechanism in which items are taken off the queue only once you indicate that you’re done processing them. If you crash after taking an item, but before you’re done processing it, the item will automatically show up in the queue after some time. You’ll learn about the details of how this mechanism works later.

Apart from tolerating failures and crashes, decoupling components gives you a ton of additional flexibility:

It lets you deploy, upgrade, and maintain one component independently of the others. Windows Azure itself is built in such a componentized manner, and individual components are deployed and maintained separately. For example, a new release of the fabric controller is dealt with separately from a new release of the storage services. This is a boon when dealing with bug fixes and new releases, and it reduces the risk of new deployments breaking your service. You can take down one part of your service, allow requests to that service to pile up in Windows Azure queues, and work on upgrading that component without any overall system downtime.
It provides you with implementation flexibility. For example, you can implement the various parts of your service in different languages or by using different toolkits, and have them interoperate using queues.
Note:
Of course, this flexibility isn’t unique to queues. You can get the same effect by having standardized interfaces on well-understood protocols. Queues just happen to make some of the implementation work easier.

2. Scaling Out

Think back to all the time you’ve spent in an airport security line. There’s a long line of people standing in front of a few security gates. Depending on the time of day and overall airport conditions, the number of security gates actually manned might go up and down. When the line (queue) becomes longer and longer, the airport could bring more personnel in to man a few more security gates. With the load now distributed, the queue (line) moves much quicker. The same scenario plays out in your local supermarket. The bigger the crowd is, the greater the number of manned checkout counters.

You can implement the same mechanism inside your services. When load increases, add to the number of frontends and worker processes independently. You can also flexibly allocate resources by monitoring the length and nature of your queues. For example, high-priority items could be placed in a separate queue. Work items that require a large amount of resources could be placed in their own queue, and then be picked up by a dedicated set of worker nodes. The variations are endless, and unique to your application.

3. Load Leveling

Load leveling is similar to scaling out. Load and stress for a system vary constantly over time. If you provision for peak load, you are wasting resources, since that peak load may show up only rarely. Also, having several virtual machines in the cloud spun up and waiting for load that isn’t there goes against the cloud philosophy of paying only for what you use. You also need to worry about how quickly the load increases—your system may be capable of handling higher load, but not a very quick increase.

Using queues, you can queue up all excess load and maintain your desired level of responsiveness. Since you’re monitoring your queues and increasing your virtual machines (or other resources when running outside the cloud) on demand, you also don’t have to worry about designing your system to always run at peak scale.

A good production example of a system that uses all of these principles is SkyNet from SmugMug.com. SmugMug is a premier photo-sharing site hosted on Amazon Web Services. Though it uses Amazon’s stack, all of its principles and techniques are applicable to Windows Azure as well. It uses a bunch of queues into which the frontends upload jobs. Its system (codenamed SkyNet, after the AI entity in the Terminator movies) monitors the load on the job queues, and automatically spins up and spins down instances as needed.

Note:

You can find details on SkyNet at http://blogs.smugmug.com/don/2008/06/03/skynet-lives-aka-ec2-smugmug/.